Which chemical properties influence the quality of red wines? In this project we’ll try to answer this question by exploring the red wine data set.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Some initial observations here:
quality is an ordered, categorical, discrete variable. Most wines are rated as 6 on a 10 point scale, 75% rated as 6 or below.density appears to have a small amount of variance, while it looks like there is much more variance in residual.sugar and chlorides.citric.acid is 0.Now let’s look at the distributions of the variables.
Some observations on these:
volatile.acidity, density and pH look nearly normal.residual.sugar and chlorides have extreme long tail.citric.acid appears to have a large number of zero values.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
There is a high concentration of residual sugar value around 2.2 (the median) with some outliers along the higher ranges.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
We see a similar distribution with chlorides. It peaks at around 0.079 (the median).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Number of zero-values:
## [1] 132
This is really a strange distribution. 8% (132/1599) of wines do not present citric acid at all.
There are 1599 observations of 13 variables in red_wine data set.
I’m most interested in the quality and how other variables affect it. The quality is scored between 0 and 10, but we only have observations with a max of 8 and min of 3. And the average quality is 5.636.
I won’t be sure until I look at correlations between variables and some bivariate plots. But volatile.acidity, citric.acid and alcohol seem to be features to do with taste of wine.
Not yet.
Some variables like residual.sugar and chlorides are distributed with a long tail. And I noticed that 8% of citric.acid values are zero.
I haven’t performed any operations yet.
Quantitatively, the following variables have relatively strong correlation with quality:
Strong correlations between other variables:
Let’s see more details.
Among all features alcohol has the strongest correlation with red wine quality (0.476).
The wines rated as 3 all have alcohol values less than or equal to 11%, while roughly 75% of wines rated as 7 or 8 have alcohol values greater than 11%.
With all six quality levels, the plots start looking messy. I created a categorical variable rating, classifying the wines as low (rating 0 to 4), medium (rating 5 and 6), and high (rating 7 to 10).
## low medium high
## 63 1319 217
We see that lower and medium quality wines are less common with the increase in alcohol levels. We also see that at higher alcohol levels, there are more higher quality wines.
There is a clear positive relationship between alcohol and quality. It makes sense since higher alcohol content would be related to a higher concentration of flavor. Lower concentrations of alcohol would likely have more of a “watery” mouthfeel and might not be perceived has being of a high quality.
Volatile acidity has a negative but the second strongest correlation with wine quality (-0.391).
I added jitter and transparency to prevent overplotting. It definitely looks like there is a negative correlation between the two.
The trend is very clear, the lower the volatile acidity level the higher the wine quality. Actually it does make sense, since too high volatile acidity level can lead to an unpleasant, vinegar taste.
Now let’s look at the fixed acidity, which has a less meaningful correlation with wine quality (0.12).
As expected, the correlation is not as obvious as it between volatile acidity and quality. How about TA (total acid), the combination of fixed acidity and volatile acidity?
Well, maybe there is a trend, but still not as clear as volatile acidity. It is not a surprise, since wine on the taste is much more complex. Different types of acid will affect our feelings of it. For example, during the ageing process of Chardonnay, the malic acid will convert to lactic acid gradually, the sharp acid taste will become more smooth.
The third strongest correlation feature for quality is sulphates (0.25). This coefficient is not so meaningful, but let’s have a look first.
Here again I added jitter and some transparency to prevent overplotting. There does appear to be a trend toward higher sulphate levels in higher rated wines. But there also are a large number of outliers for the wines rated as 5 or 6.
There is a long tail! Maybe we should try to take a log.
It’s much better. Let’s take a look at the correlation.
## cor
## 0.3086419
It is higher than previous 0.25. It makes the variable more meaningful for the wine quality.
Now let’s look at citric acid and quality, they have a correlation coefficient of 0.23. It’s not so ideal neither.
There is a large amount of variance in these values. But I can see a positive trend, the citric acid median values increase steadily with each successive quality rating, from 0.035 g/dm3 for wines rated as 3, up to 0.420 g/dm3 for wines rated as 8.
We see that there are a lot of wines have low citric acid concentration (also for high rating wines). This is consistent with our previous exploration, that 8% wine does not appear any citric acidity at all. As we know that in contrast to volatile acidity, citric acidity add freshness to the wine. But I think it is not a necessary feature to become quality wine.
Here, we’ll take a look at ph, which has the weakest correlation with quality (0.028).
Does this mean ph level is meaningless for good wine quality?
I don’t think so. Actually, with an appropriate ph level, the wine will present a better color; the growth of bacterial will under control; and together with TA (total acid) we can initially determine the taste and style of a wine. This feature is so important that every winemaker concerns of it.
And our samples are much more normal wines than excellent or poor ones. We could see from the plot, most wines have a ph level within 3.2 to 3.4 which is already an appropriate range of ph level for red wines.
Finally, I’d like to look at quality and residual sugar plotted against each other. They have the second weakest correlation (0.031).
Wow, it has such a small amount of variance! But it does make sense. As we know, based on sweetness, wine can be categorised into several types, dry, medium, sweet and so on. Each type of wine can be good or bad. So this variable does not seem to be a feature to measure the quality of a wine.
The following 4 combinations have strongest overall correlations in the data set.
Some correlations are positive, some are negative. For me, these are all reasonable relationships.
For the main feature of interest in the data set, quality has relatively strong correlations with 3 of the features: alcohol, volatile.acidity and log(sulphates).
alcohol has the strongest correlation with red wine quality (0.476). It shows a clear and positive correlation between the two in the plots. Other than a slight dip for wines rated as a 5, the median values of alcohol steadily increased with each rating.
volatile.acidity has an negative correlation with red wine quality (-0.391). The variance decreased with each increase in rating.
Like alcohol, sulphates has a positive correlation with quality (0.251). But there are also a large number of outliers for the wines rated as 5 or 6. By applying log scale, the correlation coefficient is increased to 0.309.
fixed.acidity has relatively strong relationship with several features, like pH, citric.acid and density.
The strongest relationship is easy to guess. pH and fixed.acidity.
Now let’s look at the two variables with the strongest correlations with quality plotted against each other and colored by quality.
From this plot we see that in general, wines with higher alcohol content, having a lower volatile acidity concentration produces better wines.
Next, we’ll create a similar plot to examine volatile acidity and sulphates colored by quality
We see that having more sulphates on lower volatile acidity concentration tends to produce better wines. Compare with low and medium quality wines (rated as 3 to 6), this trend is not that obvious in high quality wines (rated as 7 or 8).
From this plot we can see that higher alcohol content combine with higher sulphates concentration tend to produce higher quality wines.
Let’s have a look at the combination of pH, fixed.acidity and citric.acid. They represent the top 3 strongest correlation among all features.
This is a much more typical linear relationship. The trend is so clear, the lower the ph level the higher the fixed acidity concentration, and also higher citric acid.
Most of the relationships from this part of the analysis are consistent with what is seen in the earlier sections.
It looks like very low sulphates concentration almost completely prevent a wine to achieve a high quality rating. But on the other hand, there do are some high rated wines with very low alcohol content, and even with a slightly high volatile acidity.
I didn’t, because I think none of the relationship seems strong enough to creating a model.
For our samples, the effect of ph and fixed acidity on wine quality was very slight. On the other hand, volatile acidity and citric acid had relatively strong correlation with wine quality. As the volatile acidity concentration increased, the wine quality tended to be lower. As the Citric Acid increased, the quality tended to be higher.
For our samples, alcohol had the strongest correlation with quality (0.476). As the alcoholic content increased, the quality of wine tended to be as well. The wines rated as 3 all had alcohol content less than or equal to 11%, while roughly 75% of the high quality wines (rated as 7 or 8) had alcohol content greater than 11%.
With medium quality wines removed from the data, we see a clearer pattern that high rating wines distributed on higher alcohol content and lower volatile acidity area. In another word, the combination of high alcohol content and low volatile acidity tended to produce better wines.
The red wine data set contains 1,599 observations with 11 variables on the chemical properties, and it was provided in a clean format, without any missing data. My goal was to find out which chemical properties influence the quality of red wines.
I started by examining each of the feature to get a feel for the data set and ranges of values. As a result, I found out that most features were skewed distributed with long tail. I also noticed the high concentration of wines in the middle ranges of the ranking, which means our samples are much more normal wines than excellent or poor ones. It troubled me, since I could not figure out whether those long tails in the distribution were outliers or just the a result of uneven samples. Based on this, I did some research and start to realize that in the real world there are much more normal wines and it supposed to be like this. We should regard these long tails as outliers, because they won’t help in reasoning the pattern.
I decided to explore the relationship between features. With no surprise, there was not a single strong correlation between quality and other features, but some of them did seem to be more influential than others. It makes sense, since the wine quality is much more complex than diamond price which is dominated by their size or carat.
Most of my visualization in this project was done on the 4 features that have the highest correlation coefficient with quality: alcohol (0.476), volatile.acidity (-0.391), sulphates (0.251) and citric.acid (0.226). I also explored on the weakest correlation with quality: pH (-0.058) and residual.sugar (0.014), tried to understand the reason.
During the exploration, plots started looking messy with so many quality scores. So I created a categorical variable rating, classifying the wines as low (rating 0 to 4), medium (rating 5 and 6), and high (rating 7 to 10).
In the end, with medium quality wines removed from the final visualization, I can see a clearer pattern that high rating wines distributed on higher alcohol content and lower volatile acidity area. In another word, the combination of high alcohol content and low volatile acidity tended to produce better wines.
For improvement, I think the data set is pretty limited with 12 chemical properties, it will be great if other variables such as grape type and wine age can be included for further investigation.